Skip to content

Conversation

@pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

The algorithms suggested in this PR originate from the case logic in DataFusion (see datafusion#18152 and datafusion#18444). I think it might be useful to move them to arrow-rs instead of being tucked away in a corner of the DataFusion codebase.

What changes are included in this PR?

Adds a two-way and n-way merge algorithm that's halfway between zip and interleave. In contrast to zip the truthy and falsy arrays do not need to be prealigned. In contrast to interleave the relative order of elements in each input array is retained in the final result.

Are these changes tested?

I've already added two minimal unit tests, more should probably be added.

Are there any user-facing changes?

No breaking API changes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 31, 2025
@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 31, 2025

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pepijnve -- what do you think about also adding benchmarks to this kernel (so that future optimizations work better)

@pepijnve
Copy link
Contributor Author

what do you think about also adding benchmarks to this kernel

Good idea. I’m happy to continue working on this one. I created the PR already to get the ball rolling and solicit input from other devs.

@pepijnve
Copy link
Contributor Author

pepijnve commented Nov 2, 2025

The optimisation work that was done in #8653 would make sense here as well. That has not been done yet.

While looking into this I realised that merge on scalars is effectively identical to zip so I resolved this by delegating to zip in case of scalar input

@pepijnve
Copy link
Contributor Author

pepijnve commented Nov 2, 2025

what do you think about also adding benchmarks to this kernel

@alamb I duplicated the microbenchmark for zip as a quick fix. Is it worth trying to actually share the sets of input data and masks? If so, where should I move that code?

///
/// ```
pub fn merge_n(values: &[&dyn Array], indices: &[impl MergeIndex]) -> Result<ArrayRef, ArrowError> {
let data_type = values[0].data_type();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no check for empty values array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check added along with unit tests

let falsy = falsy_array.to_data();
let truthy = truthy_array.to_data();

let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, truthy.len());
let mut mutable = MutableArrayData::new(vec![&truthy, &falsy], false, mask.len());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// Long spans of null values are also especially cheap because they do not need to be represented
/// in an input array.
///
/// # Safety
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// # Safety
/// # Panics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

use arrow_data::transform::MutableArrayData;
use arrow_schema::ArrowError;

/// An index for the [merge] function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// An index for the [merge] function.
/// An index for the [merge_n] function.

&mut group,
&masks,
&array_1_10pct_nulls,
&non_null_scalar_1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments here look exactly the same as for array_vs_non_null_scalar above. I think the last two arguments should be swapped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I had copied these from zip_kernel.rs which has the same mistake. Fixing here and in zip_kernel.rs.

@pepijnve
Copy link
Contributor Author

There's a failing test case, but I can't say I see the relationship with this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide algorithm that allows zipping arrays whose values are not prealigned

3 participants